1 Set-up

2 Define functions

# Function to create symlinks to the original data files on active to the
# current working directory of the R project.
mk_symlinks <- function(linked_dirname, filepaths) {
    dir.create(linked_dirname, recursive = TRUE)
    lapply(filepaths, function(file) {
        target <- file.path(linked_dirname, basename(file))
        if (!file.exists(target)) {
            command <- glue::glue("ln -svf '{file}' '{target}'")
            system(command)
        }
    })
}

3 Background Information

SEACR Methodology:

Library Prep Protocol:

4 About the Pipeline

The pipeline runs the Bowtie2 alignment, quality trimming of reads with trimgalore, SEACR peak calling, and optionally MACS2 peak calling. It will also perform general QC statistics on the fastqs with fastqc and the alignment. Finally, the QC reports are collected into a single file using multiQC.

A DAG (directed acyclic graph) of the workflow is show below:

5 Activate the Environment

Optional: use tmux on the cybertron login nodes. Name the session nextflow and then request an interactive session before activating the nextflow conda environment. The project codes can be found with project info command.

Change the QUEUE and NAME variables in the code chunk below to be accurate for your Cybertron projects.

tmux new-session -s nextflow
NAME="RSC_adhoc"
QUEUE="sceaq"
qsub -I -q $QUEUE -P $(project code $NAME) -l select=1:ncpus=1:mem=8g -l walltime=8:00:00

Next, for the conda environment to be solved, you will need to set channel_priority to flexible in your conda configs as well. To read more about conda environments and thier configurations, check out the documentation.

conda config --describe channel_priority # print your current conda settings
conda config --set channel_priority flexible # set to flexible if not already done

Then create the conda environment that has the appropriate software, including nextflow, nf-core, and graphviz. The conda environment will need to be built only 1 time, afterwards, simply use conda activate nextflow.

conda env create -f env/nextflow.yaml
conda activate nextflow

6 Examine the Sample Sheet

A sample sheet in csv (comma separated values) format is used as input to the pipeline. This sample sheet must have the following column names in any order:

  • “sample”
  • “sample_id”
  • “target_or_control”
  • “single_end”
  • “read1”
  • “read2”

Below is an example of a complete sample sheet for use in the pipeline, which can be edited for your own samples in test_data/test_dataset_sample_sheet.csv.

sample_sheet <- read.csv(here::here("test_data/test_dataset_sample_sheet.csv")) %>%
    mutate(single_end = "false") %>%
    mutate(sample = str_split_fixed(sample_id, pattern = "_", n = 3)[, 1]) %>%
    select(sample, everything())

head(sample_sheet)
# dim(sample_sheet)

7 Run the Example Data

To ensure that the pipeline works, first run the test data set. This example will run using the data found in the test_sample_sheet.csv.

./main_run.sh "test_dataset"

8 Configure Pipeline for Your Data

8.1 Configuration file

Open the configuration file nextflow.config and edit the necessary parameters for building the index, and/or running the alignment or peak calling steps.

## //working directory for temporary/intermediate files produced in the workflow processes
## workDir = "$HOME/temp"
## 
## //global parameters
## params {
##     // general options
##     sample_sheet                = "./test_data/test_dataset_sample_sheet.csv"
##     queue                       = 'sceaq'
##     project                     = '207f23bf-acb6-4835-8bfe-142436acb58c'
##     outdir                      = "./results/mouse"
##     peaks_outdir                = "${params.outdir}/peaks_calls"
##     publish_dir_mode            = 'copy'
## 
##     //Bowtie params
##     // description: fasta OR the bowtie index path are required. If only fasta provided, set build_index to true
##     build_index                 = false
##     fasta                       = '/gpfs/shared_data/Bowtie2/mm39.fa'
##     index                       = '/gpfs/shared_data/Bowtie2/mm39_index'
##     save_unaligned              = false
## 
## <...>

8.2 Global Params

Be sure to change the following lines for the global parameters:

  • sample_sheet
  • queue
  • project code
  • outdir
  • peak_outdir
## //global parameters
## params {
##     // general options
##     sample_sheet                = "./test_data/test_dataset_sample_sheet.csv"
##     queue                       = 'sceaq'
##     project                     = '207f23bf-acb6-4835-8bfe-142436acb58c'
##     outdir                      = "./results/mouse"
##     peaks_outdir                = "${params.outdir}/peaks_calls"
##     publish_dir_mode            = 'copy'

8.3 Genomic References

Additionally, determine if you require a new bowtie2 index to be build for the target genome and/or the spike-in genome. The pipeline requires either a fasta filepath OR Bowtie2 index filepath. This is also required for the spike-in, with E. Coli provided as a default.

E. coli is the default since it that is a carry over DNA from the Cut&Run library prep methodology and is expected to be present in all Cut&Run experiments regardless if exogenous spike-in is used like Yeast. Please see here for more information on spike-in normalization.

Change the following lines for alignment reference files when needed:

  • build_index
  • fasta
  • index
  • build_spike_index
  • spike_fasta
  • spike_index
##     //Bowtie params
##     // description: fasta OR the bowtie index path are required. If only fasta provided, set build_index to true
##     build_index                 = false
##     fasta                       = '/gpfs/shared_data/Bowtie2/mm39.fa'
##     index                       = '/gpfs/shared_data/Bowtie2/mm39_index'
##     save_unaligned              = false
## 
##     //spike-in params
##     // description: fasta OR index path are required. If only fasta provided, set build_spike_index to true
##     build_spike_index           = false
##     spike_fasta                 = '/gpfs/shared_data/Bowtie2/GCF_000005845.2_ASM584v2_genomic.fa'
##     spike_index                 = '/gpfs/shared_data/Bowtie2/ecoli_index'

8.4 Optional: MACS2

Finally, decide whether to run MACS2 calls along with the SEACR peak calling algorithm (default = true). MACS2 will use the effective genome size value provided in gsize parameter.

If you are using a non-model organism or simply don’t want to use the effective genome size provided in literature or MACS2 documentation, you can set run_khmer = true to calculate an effective genome size using the target genome fasta fasta filepath and read-length.

  • gsize
  • run_macs2
  • run_khmer
##     //Effective genome size
##     gsize                       = 1.87e9 //should NOT be scientific notation to be used with deeptools
##     <...>
##     //MACS2 and khmer params
##     run_macs2                   = true
##     run_khmer                   = false //if true, will override the value in gsize parameter
##     kmer_size                   = 150 //kmer_size is the read-length

8.5 Advanced Options

In the nextflow.config, you can define additional command line arguments to the scientific software under process scope. You may use the advanced options to change computational resources requested for different processes. The CPUs and memory parameters can updated to request a larger amount of resources like CPUs or memory if files are large. You may also use these process scope to edit the command line paramters for processes in the workflow. Please be aware the default command line parameters for Bowtie2 processes are already provided for both target and spike-in alignment, but can be edited.

The most commonly modified and important process parameters are listed toward the top of the process scope in the nextflow.config file.

You can edit the command line parameters for SEACR and MACS2 parameters that often need to be re-run multiple times when deciding on the appropriate peak-set to use. For example, MACS2 broad and narrow peak calling parameters for different histone modifications which can be modified using the ext.args parameter.

##  [1] // Computational resource allocation for the processes run in the workflow                                     
##  [2] process {                                                                                                      
##  [3]     //Bowtie2 aligner process specific parameters                                                              
##  [4]     withName: BOWTIE2_ALIGN {                                                                                  
##  [5]         cpus = { 2 * task.attempt }                                                                            
##  [6]         memory = { 32.GB * task.attempt }                                                                      
##  [7]         ext.args = '--local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700'
##  [8]         ext.args2 = ''      //command line arguments for `samtools sort`                                       
##  [9]     }                                                                                                          
## [10]     //SEACR peak calling resources                                                                             
## [11]     withName: SEACR_CALLPEAK {                                                                                 
## [12]         cpus = { 1 * task.attempt }                                                                            
## [13]         memory = { 16.GB * task.attempt }                                                                      
## [14]         ext.version = '1.4' //version 1.3 and 1.4 supported                                                    
## [15]         ext.args = '--normalize norm --mode stringent --remove yes'                                            
## [16]         publishDir = [...]                                                                                     
## [17]                                                                                                                
## [18]     }                                                                                                          
## [19]     //MACS2 peak calling resources                                                                             
## [20]     withName: MACS2_CALLPEAK {                                                                                 
## [21]         cpus = { 1 * task.attempt }                                                                            
## [22]         memory = { 16.GB * task.attempt }                                                                      
## [23]         ext.args = '-q 0.01 --keep-dup all'                                                                    
## [24]         publishDir = [...]                                                                                     
## [25]                                                                                                                
## [26]     }

SEACR has the option to be set to SEACR v1.4 or SEACR v1.3 - which have particularly different commandline interfaces, changes in the methods for normalization to IgG, and v1.4 can optionally remove peaks found in IgG. Please see here for the full changelog.

For SEACR v1.3, Often, you will need to change SEACR from “non” to “norm” for different normalization strategies whether you’re using IgG normalization or spike-in normalization. The example below demonstrates how to change the commandline params and version by editing ext.version and ext.args.

##     //SEACR peak calling resources
##     withName: SEACR_CALLPEAK {
##         cpus = { 1 * task.attempt }
##         memory = { 16.GB * task.attempt }
##         ext.version = '1.3' //version 1.3 and 1.4 supported
##         ext.args = 'norm stringent'
##         publishDir = [...]
##                     
##     }

9 Run Script

usethis::edit_file(here::here("main_run.sh"))
## • Edit '/active/taylor_s/people/jsmi26/RPDEV/cutandrun_nf/main_run.sh'

Decide on the NFX_PROFILE, which allows you to run the processes either locally, or using the PBS job scheduler on Cybertron, and determine if you’d like to use singularity containers or docker containers.

  1. PBS_singularity [DEFAULT, recommended] * you can submit a PBS job that will use singularity containers on Cybertron * This takes care of requesting the appropriate resources using PBS

  2. local_singularity * locally on an interactive session Cybertron with singularity * requires appropriate computational resources be requested using qsub -I -q <queue_name> -P <project_code> -l select=1:ncpus=4:mem=32GB

Edit the script main_run.sh and change the values for the NFX_PROFILE variable if desired.

## #Options: 'local_singularity', 'PBS_singularity'
## NFX_PROFILE='PBS_singularity'

9.1 Alignment and Peak Calls

Edit the variables in the main_run.sh script for entry-point of the workflow. The default option “align_call_peaks” for the NFX_ENTRY will run the full pipeline (QC, alignment, peak calling, coverage tracks).

## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='align_call_peaks'

If you already have aligned BAM files, see test_data/test_dataset_bams_sample_sheet.csv for an example to call peaks only using the entry call_peaks.

## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='call_peaks'

Then, execute the main_run.sh script in order to complete the peak calling on the samples. Provide a small descriptive prefix for the pipeline run.

./main_run.sh "my_analysis"

9.2 Optional: Build the Index and Exit Pipeline

You can also change the entry-point of the workflow, which is accomplished by setting the NFX_ENTRY variable in the main_run.sh script to be bowtie2_index_only. This will allow the pipeline to run only the Bowtie2 build process and exit upon completion of the index building step.

## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='bowtie2_index_only'
./main_run.sh "bowtie2_index"

10 Expected Outputs

Under the path provided in the nextflow config for params “outdir”, you will find directories named for each of the modules. Lets say params.outdir = ./results/mouse, and peaks_outdir = "${params.outdir}/peak_calls".

There will be the following file structure:

  • results/mouse/

    • bamtobedgraph/

    • bowtie2_align/

      • {sample_id}.bam
      • {sample_id}.bowtie2.log
    • deeptools_bamcoverage/

    • fastqc/

    • fastqc_trim/

    • multiqc/

    • peak_calls/

      • seacr_callpeak/
        • {sample_id}.[norm or non].[stringent or relaxed].bed
      • macs2_call_peak/
        • {sample_id}_peaks.[narrowPeak or broadPeak]
        • {sample_id}_peaks.xls
        • {sample_id}_summits.bed
    • picard_markduplicates/

    • samtools_index/

    • samtools_nsort/

    • samtools_sort/

    • spikein_align/

      • {sample_id}_spikein.bam
      • {sample_id}_spikein.bowtie2.log
    • trimgalore/

      • {sample_id}_1.gz_trimming_report.txt
      • {sample_id}_2.gz_trimming_report.txt
      • {sample_id}.M1_H3K27_NK_1_val_1.fq.gz
      • {sample_id}.M1_H3K27_NK_1_val_2.fq.gz

10.1 Pipeline Reports

In addition, there will be an HTML report with information on where the temp data is stored in the workDir path, and general run statistics such as resource utilized versus requested, which helps with optimization. It will also provide information on how much walltime was used per sample, total CPU hours, etc.

The HTML file is found in reports directory and will have the prefix defined on the command line when the ./main_run.sh "my_analysis" was invoked, so in this example it would be named “my_analysis_{DATE}.html”.

There will also be a detailed nextflow log file that is useful for de-bugging which will also be named in this example, “my_analysis_{DATE}_nextflow.log”.

Finally, the pipeline will produce a DAG - Directed acyclic graph, which describes the workflow channels (inputs) and the modules. The DAG image will be saved under dag/ directory with the name “my_analysis_{DATE}_dag.pdf”.